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QVAmTTATIVE WlEDICnON MJTHOD 
HIV-l. Ita, sp«.fi.,Uy, 4. in^rton p^vide. „e*<His fer predicting 

baseduponthegaHXjpeoftbcdiseaseafiWiiig.prtoit. , 
BACKGROUND 

T«toi^ to detemmi. to r«istanoe of HV-I to a d,«pouac agen, are becomi™ 
rZfjT"^ M»y p«i«« «peH«« „^ 

" ^"^ •» °»«i"8 «*or developing a .^stance to 

He «iou, di&.« antf.fflv.l agenh ttatbav. bea, devdoped o.=r to vcan, were 
.^.y ^inwered to p«ien„ abne. as n»„,tor.p,. . i^Z^Z 

^. waa obaervcd. ali to con,p,„„ds lost toir .fle«iv«es. over JL ^» ^ 
dejnon^^ ^ bebind to 

^ . to development of reais«„ce of to vims to to d,„gTa» ^ 
Z^^^" al, 1989. Science. 246. 1,55-8); Ms is largely due Z biC 
HV oonbn™™Iy to gen«ato a mnnber of genelic variants in a replicating ^ 
Wubtoa Ttec g™«c cta,^ g««r.uy alter to conSgnration of the HIV 

»c^,ble to tnbibition by con,onnda developed to tatge. ton. anti^S 
teapy ,s ongomg and if viral „=plication ia not contpletdy «tpp,esaed. to sdZ^ „, 

genebcva,i.t.isinevitabl.«.dtovi,.lpop^,nb=coLLJ^to^C 
Sn,ce ton. « comWnaflon Ibetapy. using dmga tot tatge. botb HV reverse 
t^^ortptase (R„ and protease (PR) ntolecnles. baa pnmd«l itLsed cZl 7Z 
^^on. .d tos ptovided extends. cHnica. beneBt to patient, bT^"^ 
towever. ,t has become clear that even patients being treated ™ih lri„I, «. 
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Srnce patients in the developed world are generally prescribed cocktails of therapeutic 
drugs, not all HIV-l mfections origmate with a wUd type, drug sensitive strain fiom 
which drug resistance will emerge - with the increase m prevalence of drug resistant 
Shams comes the mcrease in infections tiiat actually begin with drug resistant strains 
Mections with pre-existing drug resistance immediately reduce the drug options for 
drug treatinent and en^hasize the importance of drug resistance information to 
optimize mitial flierapy for tiiese patients. 

Moreover, as flie number of avaUable antiretroviral agents has increased, so has flie 
ni^ber of possible drug combinations and combination therapies. It is therefore very 
difficult. ,f not impossible, for tiie physician to establish tiie optimal combination for an 
mdmdual. Although fliere are many drugs available for use hi combmation tiierapy the 
choices can quickly be exhausted and flie patient can rapidly experience cinlical 
progression or deterioration if tiie wrong treatinent decisions are made. TTie key to 
tailored, individuaUzed thempy Ues in the effective profiling of tiie mdividual patient's 
vmis population in terms of sensitivity or resistance to the avaUable drugs This 
requires the advent of truly individualized tiierapy. 

There are certam solutions to tiiis problem currently m use. 

Phenotyping directly measures tiie actiial sensitivity of a patient's pafliogen or 
mahgnant ceU to particular tiierapeutic agents. However, tins can be slow, labour- 
intaisive and flius expensive. 

A second approach to measuring resistance mvolves genotyping tests that detect 
specific genetic changes (mutations) in tiie viral genome which lead to amino acid 
changes m at least one of tiie viral proteuis. known or suspected to be associated with 
resistance. Althou^ genotyping tests can be performed more rapidly, a problem wifli 
g«K,typmg is tiiat there are now over 100 individual mutations witti evidence of an 
effect on susceptibiUty to HIV-l drugs and new ones are constanfly being discovered, 
m parallel witii die development of new drugs and treatment strategies. ITie relationship 
between these point mutations, deletions and insertions and tiie actual susceptibility of 
tiie vnus to drug tiierapy is exhremely complex and mteractive. An example of this 
complexity is tiie M184V mutation ttiat confers resistance to 3TC but reverses AZT 
resistance. Ihe 333D/B mutation, however, reverses tiiis effect and can lead to dual 
AZT/3TC resistance. 

Sophisticated interpretation is flierefore required to predict what tiie net effect of tfiese 
mutations might be on flie susceptibiUty of ttie virus population to tiie various 
fliempeutic agents. Rules-based computer algoritimis have provided some assistance 
tor example, see International patent appUcation WOOl/79540. An overview of tiiis 
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type of technique is presented in Figure 1: (figure 1 from poster). However, there 
remams a continuing need for the quantitative prediction of HIV drug susceptibility 
fiom vual genotype. Furthemiore, because the majority of HIV patients have now been 
exposed to drug cocktails, it is thought that the disease-causing retroviruses tend to 
^ntaneously generate mutations that have often co-evolved. This makes the analysis 
of which mutations are responsible for resistance to which dmgs ahnost impossible 
usmg cmrently avaUable techniques. It also means that mutations that contribute to 
resistance are being overlooked using the currently available analysis techniques. 

It is therefore an aim of the present invention to provide meihods for improving the 
interpretation of genotypic results. 

It is a further aim of the invention to provide methods for determining (or predicting) a 
pheno^e based on a genotype. 

It is abo a forther aim of the invention to provide methods for predicting the resistance 
of an mv variant of a particular genotype to a therapy or a therapeutic agent. 
It is also an aim of the invention to predict resistance of a patient to therapy. 

It w also an aim of the invention to provide methods to assess the effectiveness or 
efficiency of a therapy or to optimize a patient's therapy. 

It is also an aim of the invention to identify novel HIV-1 mutations that are associated 
with resistance to particular drug therapies or combination therapies. 
SUMMARY OF THE INVENTION 

A solution to these problems involves new methods for measuring drug resistance by 
correlatmg genotypic infomiation with phenotypic drug resistance proffles measured 
experimentally. "t^wurea 

According to a first aspect of the invention, there is provided a method for quantitating 
the mdividual contribution of a mutation or combination of mutations to the dmg 
resistance phenotype exh&ited by HIV. said method comprising the steps of: 

a) performing a linear regression analysis using data fiom a dataset of matching 
genotypes and phenotypes. whereby the log fold resistance. pFR. is modelled as the 
sum of all the individual resistance contributions for wch of the mutations or 
combmations of mutations that occur in HIV according to the foUowing equation; 



ftctor. Ma , Mb Mz. for each mutation or combination of mutations by a 

resistance coefficiait 0a. Pb. .... 

wherein the mutation factor assigned to eadi mutation or combination of mutations 
^ects the degree to which that mutation or combination of mutations is present in the 
mv stram and. if present, to which degree flie mutation is present in a mixture; 

wherein each resistance coefficient reflects tiie contribution of flxe mutation or 
combmahon of mutations to the fold lesistance exhibited by tiie strain; 

and wherem flie enor tenn s, represents tiie difference between tiie modelled prediction 
and tiie expenmenfeiUy determined measurement 

This metiiod involves a data driven technique for quantitative drug susceptibiHty 
prediction. This method uses a multiple linear regression model to estimate coefficient 
values tf«t accurately reflect the contribution made by a particular HIV mutation or 
combmation of mutations to resistance to a particular drug. Repeating tiie metiiod for 
each candidate flierapeutic dmg allows flie compilation of a global pictiire of drug 
resisbfflce exhibited by a particular HIV stram. 

This method has aflowed tfie identification of mutations hiflierto unrecognised as 

havmg an effect on dmgresistanceinmv.Themefliod also allows the identification of 
primary (smgle mutations) and secondary (the co-occ«rence of two mutations) or 
higher order terms resistance-associated mutations for new and existing drugs 
Acconhngly. a fiirtiier aspect of the invention provides a metiiod of identifying a 
mu^tion tiiat affects tiie degree of drug resistance exhibited by an HIV strain using a 
metiiod according to tiie first aspect of tiie mvention. 

The m^ethod of tiie first aspect of flie invention is also advantageous over current 
meftods smce it allows tiie quantitative, purely data-driven, objective assessment of flie 
contiibution of mutations and combinations of mutations to dmg resistance. THe 
metiiod also allows flie deconvolution of tiie individual contribution made by particular 
stations to flie drug resistance phenotj^e. Unlike existing mettiods. tiie metiiod is 
able to correct for correlating mutations fliat on tiie face of it appear to affect drug 
resis^uice. but which in fact only correlate in flieir occurrence with resistance causing 
mutations and are fliemselves phenotypicaHy silent. 

IHe metiiod has aUowed tiie design of an automated computational t«.hnique for tiie 

pr^c^onoftiiedmgresistanceprefflepossess^lbyaparticularHIVs^^^^^^^ 
patient THe mefliods tiius aUow flie dete^unation of a patient phenotypT^Il 



having to perform any phenotypic testing whatsoever. This has clear ramifications for 
the bespoke design, optimization and assessment of strategies for individual patient 
therapy based upon the g^otype of the infecting agent 

The invention also provides diagnostic kits for performing each of the mefliods of the 
invention desQtibed herein. 

In any population of HIV variants, there is a wide distribution of drug resistance 
phenotypes for any particular drug, ranging fiom hyper-susceptibiUty to strong 
resistance (see Figure 2 (figure 2 bom poster)). The expression "drug resistance 
phenotype" means fhe resistance of an HIV virus to a tested therapy, therapeutic agent 
or drug. The term "resistance" as used herem, pertains to the capacity of resistance, 
sensitivity, susceptibiUty, or effectiveness of a therapy against a disease. The term 
"therapy" includes but is not limited to a drug, pharmaceutical, or any other compound 
or combination of compounds that can be used in therapy or therapeutic treatment of 
HTV. This distribution of dmg resistance reflects the' large number of different 
15 genotypes that are present in the population. Some variants may only have one 
mutation that is coffelated with drug resistance, vAilst others wifl have several or 
numerous such mutations, each of which may impart its own conliibution to the drug 
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Adding an additional level of compUcation are the phenomena of antagonism, synergy 
and enhancement, where certain mutations may add to or detract fiom the effect of 
other mutations in a manner not predictable torn studying the effects of the mdividual 
mutations alone. Highly correlated mutations are also problematic. These are mutations 
that ahnost always co-occur in a strain, but only one of the mutations actually has an 
effect on drug resistance. For example, when one of these 2 mutations has an effect on 
25 resistance and the other mutation does not (this mutation might for example be highly 
correlated with the resistance mutation because it affects the repUcation rate of the 
virus), the effect can erroneously be assigned to either one of the mutations. 

Examples of mutations known or suspected to influence the sensitivity of HIV to drug 
therapy may be found on the internet at http://hiv-web.lanl.gov; 
30 http://hivdb.stanford.edu/hiv/; or http://www.viral-resistance.com. 

hi mV, two sections of the genome are generally studied: Protease (PR) and Reverse 
Transcriptase (RT). The methods of the present invention can equaUy be applied to 
other sections of the HIV genome such as integrase (IN). A mutation is presented as a 
number refeiring to the position m the protein, followed by the amino acid(s) on that 



podti^ if it differs from the amino acid in the HXB2 HIV reference. In the temis 
mcluded above, the mutations are represented as "A", "B", . . .»z". 

Mixtures reflect the diversity of the HIV population in a sample. It mea«, that on that 
postbon two subsets of the population have a different , amino acid. Mixtures are 
denoted by separating amino acids Mdth the V charact«: 65K/R (mixture of TC' and 
R at position 65), 

me. m.„ tan •«■> «mtao .id. « ««i ™ . podlio. in ™bs* of to 

population, the dummy amino acid 'X* is used. 

tartio«™dcn«cd*y,uUi,«toi„sertposWonbdM 69.2S (^im^ofS- 
«t losert posmon 2). Ddettons m denoted by a minus sign: 69-. 

Exanvles of mnbtlons pfes«,. in fte RT domain of HIV confertig .esisBne. to . 
Kverae tramonptase inlubitot include 69C. 69V, 69T, 75A. 1011, 103T 103N 184T 
U^.90Y19M. 22,V. 22,1. „d «3V. AddiHonal exLnp,« o"^ 

^ Of mv conferring r.si«ance .o a «ven« 

tamcnpta, „4,b.tor mcMe 24M. 48A, «d 53L. A n,„Mon n»y afflct «sis(ance 
atoneormcombmalionwiUiotliermiiliitions. 

Ito to pun»»es of a,, invention, to n>uWons identified shonld be associated wid. 

or «.scq«ibiH6r to dn« to«py, for exaniple an antiretroviral dmg. Tie 
*g«e to whch aparticdarmntaionpa.™ may affect resistance .nay be detelined 

«h . to ANTIVIROORAM® (Vinx, Bdgiun) (s« WO97/27480). 1 thU 

r^iste^ is detennined with respect to . U^atory r^ 
HVlAmiB. The dtflcrence m IC^o (to concentration of dmg reqnirod t. nxl^^, to 
^ f "y 50%) between to patient santp,. and to reftie.ee *al 
torn .s de.ennu.ed as a quotient. This fold change in IC„ is reported and indicative of 
Z7^^:^^f.\^ -tos Based on U,e Changes h, IC^ cut.ofr values 

' s^sitive or resistant to a 

^mm^ts are nndenvay to conrpUc data relating to to correspondence of cerhnn 
~> w.* dtng r«»ance phenotvpe, and titese generdly lead to to generation of 
^ dat-^ses of tables tot i,h«tiate ti. notching ^ 
fcr various a«i,a,n,ir„ dmg,. Snch databases bri^ ,oge»>er th" 
^owledge of b«h a genotypie and phe^ database ^ phenoty^" 
com.™ l^enotypic resistance ,,,„« for HW to a. to. one tire^ 
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multiple drug therapies. For example, the phenotypic resistance values of tested HIV 
viruses, with a fold resistance determination compared to the reference HIV virus (wild 
type). 

The dataset used herein is a dataset developed by the AppHcant. which consists of a set 
of matching genotype / phenotype measurements with possible multiple phenotype 
measurements per genotype. However, any similar dataset may be used, provided that 
there are sufficient entries for each genotype / phenotype measurement for the data to 
be significant In the Virco dataset, the mutations are defined relative to HXB2 at 
amino acid level. 

The phenotypes are presented as pFR values, where pFR is equal to - log (FR) where 
FR denotes the Fold Resistance. Negative pFR values thus denote resistance and 
positive vahies denote hyper-susceptibility. For example, a pFR value of -1.0 is equal 
to 10-fold resistance. An example of the pFR distribution for Saquinavir (SQV) is 
shown in Figure 3 (slide 5). Figure 4 (slides 7 and 8) shows the pFR distribution for the 
"48V" mutation on SQV. It is clear from this that the 48V subset does not behave the 
same as the whole dataset 

The problems of unwanted correlations between mutations where not all correlated 
mutations contribute to the drug resistance phenotype are iUustrated in Figure 5 (slide 
9). Here, the left hand panel shows the pER distribufion for the 711 mutation. When the 
effects of mutations 48V and 84V are removed (right hand panel), the pFR distribution 
is maikedly increased Oess drag resistance). 

According to the invention, the predicted fold resistance of an HIV strain of a particular 
genotype may be calculated by summing the individual resistance contributions for 
each of the mutations or combinations of mutations in die mutation pattern of that 
genotype. The method uses linear regression models, so that the phenotype prediction, 
pFii IS calculated in the following equation (1): 

pFR'=fi^M^+fi^M^+...+fi M^s 



The mdependent variables Ma . Mb Mz, are referred to herein as mutation 

actors, each of which reflects the degree to which the mutation or combination of 
mutations is present in the HIV stndn and. if present, whether or not the mutation is 
present in a mixture. 
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The resistance coefficients A4, /3b. fiz represent the contribution to the total pF/? 
prediction for each single mutation. 

Each mutation factor Mt thus represents the presence or absence of the corresponding 
mutation and each coefficient /?, represents the contribution to the pFR change for that 
specific mutation. 

The mutation fector may take into account 1* order terms (single mutations) as weU as 
2"^ Older tenns (the co-occurrence of two mutations) and in general n* order tenns. For 
example 2"* order terms take the fonn: 

The independent variable Mab represents the co-occurrence of mutations A and B and 
the coefficient ^AB represents the synergy or antagonism between mutations A and B. 

Higher order terms affect for interactions between mutations: 

• reversal or antagonism: positive pFR shifts for mutation couples. 

• synergy or enhancemeat: extra negative pFR shift for mutation coi^le. 

For example, consider the following (artificial) model: 



Mutation 


Coefficient 


A 


-0.46 


B 


-0.92 


C 


-0.64 , 


D 


0.63 


E 


-0.16 


F 


-0.19 


F&A 


-0.09 



Consider a virus with foUowing mutations: F, A and E. Applying equation (1), this 
virus will have ApFR prediction: 

or almost 8-fold resistance. 

Note that in the model F and A are synergistic since their co-occurrence decreases the 
pFR by an extra -0.09. 
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The error term is e. which is the difference between the prediction and the 
measurement. This error term contains both the measurement error on the phenotype 
measurement and a model error (if the underlying model has higher order terms that 
are not taken into account in the regression modeO. 

Mutation fectors for single mutations (Ma. Mb. Mz) are calculated as follows: 
if the mutation is present in the HIV strain, a positive mutation fector is assigned; 
if the single mutation is not present, the mutation factor assigned is zero; 
if the single mutation is present in a mixture, an averaged positive mutation fector is 



Conveniently, mutation factors range between 0 and 1 where 0 means not present and 1 
means present. Values between 0 and 1 means that the mutation is present in a mixture. 
Accordingly, a positive mutation fector is assigned the value 1. 

Mbrtures are modeUed as causmg the average shift of its constituent mutations. Since 
methods for the quantitation of the precise proportions of mixtures to wUd type are 
expensive and tune-consuming, mixture with wild type may conveniently be treated as 
causmg half the pfR shift of the resistance mutation (mutation fector = 0,5). However 
as the skilled reader will appreciate, a more precise mutation factor may be assigned if 
the true proportion in the mixhure is known. 

Mutation fectors for double mutations (MAB eta) are calculated as follows; 

if both the mutations are present in the mv strain, a positive mutation fector is 
assigned (convmiently, the value 1); 

if neither of the mutations are present, the mutation factor assigned is zero; 

if both mufetions are present and one mutation is present in a mixture, an averaged 
positive mutation fector is assigned (conveniently, 0.5); 

if both mutations are present in a mixture, a reduced averaged positive mutation factor 
IS assigned (in this example. 0.25). The fector 0.25 is the product of the M-fectois of 
both flie smgle constituent mutations. This is the result of the assumption that these 
matures are independent of each other. Of course, this is an approximation, since in a 
real blood sample, the mixtures are not independent of each other. For exiample if only 
2 viruses were present, virus A (no mutations) for 70% and virus B (mutations'461 and 
84V) for 30%, then a mixture would be detected on both positions 46 and 84. If these 
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concentrations were known, it would be possible to fine tune the mutation factor of 
0.25. If this information is not available, the best statistical guess is 0.5*0.5: this being 
the average value that would be measured for the mutation couple being present for a 
population of samples that have these mixtures on 46 and 84 in all possible 
conc»trations. 

Calculation of the resistance coefficient (fiA .... fe. M is perfomied by 
evaluating the dataset for the drug phenotype reported for each mutation or 
combination of mutations. 

The problem of unwanted correlations has been discussed above. Unwanted 
correlations are preferably removed according to the method of the invention. A 
preferred way to do this is to use an algorithm that has been developed by the inventors 
to track the change in pER as the effects of individual mutations or combioations of 
mutations are removed fi:om the dataset. The effect of each mutation or combination of . 
mutations is thus separated out The methodology foUows mutation tngectories towards 
the global average as the effects of mdividual mutations or combinations of mutations 
are removed. The st^s are as follows: 

Calculate average pFR for aU mutations with a sufBcient count in the database to be 
significant; 

Detennine the extremes (maximum, minimum), and select the mutation with the pFR 
fur&est away fix>m the global average; 

Remove all virus strains that have the selected mutation and reitraate fiom step 1; 

Stop whai the selected mutation in step 2 has an average pFR that approximates to the 
global average. 

In this manner, mutations that do not cause resistance, but which are often present with 
mutations that do cause resistance will have a higher average pFR (less resistance). 
Removing the virus strains with a certam resistance causing mutation results in an 
increase of flie average pFR for correlating mutations. 

A suitable threshold at which a count in the database becomes sufficiently significant 
will be apparent to the skilled reader and wiU be dependent on the database size. For 
example, thresholds of 5. 10, 15, 20. 25. 30 or more may be suitable. In the examples 
discussed herein, a threshold of 20 times was used. 
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By an "average pFR that approximates to the global average" is meant that the average 
pER is within a fraction of the standard deviation of the remaining population. A 
convenient fraction ranges between about 0.3 and 0,5. 

A comparison of the change in the global average pFR with the change in the average 
pFR for selected mutations with increasing iterations of the algorithm is shown in 
Figure 6a. Figure 6b shows an example, where the average pFR for 711 (unwanted 
correlation) jumps up as a result of removing from the dataset virus strains that have 
"711 & 84V" and "48V" mutations. 

An alternative, analogous methodology for removing unwanted correlations is as 
foUows; this, is an extension of the mutation trajectories algorithm discussed above. The 
stq)s of this method are as follows: 

1. Calculate correlation coefiBcient between all mutations (with a sufficient count 
in the database) and the pFR; 

2. Determine the extremes (maximum, minimum), and select the mutation with the 
highest (absolute value of) correlation coefficiart; 

3. Calculate a linear model for the pFR with the selected mutation(s) (from step 2, 
all previous iterations); 

4. Take the residue (pFR minus the predicted value from flie model); 

5. Calculate correlation coefficient between aU mutations (wifli a sufficient count 
in the database) and the residue; 

6. Determine flie extremes (maximum, minimum), and select the mutation witii tiie 
highest (absolute value of) correlation coefficient; 

7. Calculate a linear model for the pER with the selected mutation(s) (from step 6, 
all previous iterations); and 

8. Reiterate from stq> 4; 

9. Stop when tiie selected mutation in step 7) has a correlation coefficient that 
approximates to zero. 

As with tiie mutation tiajectories algorithm described above, tiie effect of mutations 
that do not themselves cause resistance, but which are often present witii mutations that 
do cause resistance, is excluded and tiws does not distort the real values. 
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In further preferred embodiments of the invention, problems of small datasets for 
particular mutations or combinations of mutations are dealt with by applying the 
method recursively to the set of virus strains that exhibit those particular mutations or 
combinations of mutations. 

In still further preferred embodiments of the invention, the foUowing additional 
correlations are taken into account: 

multiple entries of the same virus strain (or virus strains grown ftom the same stock 
solution) that cause unwanted correlations; 

censored values in genotype / phenotype database (for exanq)le, EC50 value = '> 
1 HM'). These are phenotypes beyond the assay range. 

Preferably, censored values are dealt with by attempting to constnict a model that is 
consistent from extrapolations. Censored values are thus modelled by replacing the 
censored value by a maximum likelihood estimation, assuming knowledge of the 
standard deviation of the measurement error. 

A preferred technique for the generation of a maximum likelihood estimation is as 
follows: 

Use value V as if the censor was "= 
Calculate linear regression model; 
Look at the prediction P fiom the model: 

■ P < V - 0.798 (centre of gravity of half Gaussian distribution) 

o Remove value from training data for next iteration 

■ V-0.798<P<V 

o Use V* = V- 0.798 

■ V<P 

o Use V* centre of gravity of tail (<V) of a normal distribution N (P, ) as 

value for next iteration. 

■\ 

Accordingly, for each iteration, when the prediction and measurement contradict, 
censored values are taken into account When the piwiiction and measurement are 
consistent, censored values are disregarded, on the basis that no fiirther information is 
provided and their inclusion has no worth. 

In one preferred embodiment of this aspect of the invention, the number of calculations 
necessary in the linear regression analysis may be reduced. The computational power 
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and memory requirement that is currently generally available is insufficient to allow a 
full second order model to be evaluated for a large dataset, based on all possible single 
mutations and second order terms, since the number of terms increases quadratically 
with the number of mutations considered. This number increases with a larger dataset 
since more rare mutations are in a large.database. 

In order to reduce the amount of terms, a first order regr^sion may be performed fiom 
the list of mutations that occur in the dataset above a threshold number of times A 
suitable threshold at which a comit in the database becomes sufficiently significant wiU 
be apparent to the skiUed reader and will be dependent on the database size For 
example, thresholds of 5. 10. 15, 20, 25, 30 or more may be suitable. In the examples 
discussed herein, a threshold of 20 times was used. Tte significant terms fiom this first 
order regression are withheld and the list of these terms is then used to perform a 
second order regression. In the second order regression only the single mutations and 
combmations of mutations are used that were found significant in the first order model. 
Agam. a threshold significance will be apparent to the skilled reader- an example is if 
the probabiUty that the real value of the term is 0, is smaller than 0.001. 

For example, a first order regression perfi>rmed on the matching genotype / phenotype 
dataset for Indinavir (34.445 measurements) for those mutations that occur at leW 20 
tmies results in a first order model that withholds a list of 94 single mutations that are 
considered significant 

This list is then used as a startmg list for a second order legressioit It should be noted 
that It may be advantageous to exclude certain very common mutations fiom the 
calculation. 31 is one example. The reason is that a mutation must occur at least a 
threshold number of times and the inverse also has to be true: the count of viruses not 
having the mutation 31 or the couple mt 31 and another mutation should also be above 
flie threshold value (e.g. 20). Taking this into account results in excluding 31 fiom the 
regression in practice. 

In the second order regression, all the single mutations and all couples of mutations 
fiom the list are used as potential terms. The significant terms are then withheld by the 
regression algorithm. 

According to a fiirther aspect of the invention, there is provided amethod of calculating 
the quantitative contribution of a mutation pattern to the drug resistance phenotype 
exhibited by an mv stiain, said method comprising the steps of: 

a) obtaining a genetic sequence of said HIV strain. 
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b) identifying the pattern of mutations in said genetic sequence, wherein said mutations 
are associated with resistance or susceptibility to drug therapy, and 

c) calculating the fold resistance of the HIV stram as compared to the wild type HIV 
strain by performing a linear regression analysis, whereby the log fold resistance, pER, 
is modelled as the sum of all the individual resistance contributions for each of the 
mutations or combinations of mutations that occur in said HIV strain according to the 
following equation; 

pFR = fijf^ + fi^M^ + . • . + fi^M^ + e 

wherein each individual resistance contribution is calculated by multiplymg a mutation 
factor. Ma . , Afz, for each mutation or combination of mutations by a 
resistance coeflBcient &Af PB> .... 

wherein the mutation factor assigned reflects the degree to which the mutation or 
combmation of mutations is present in the HIV strain and, if present, to which degree 
the mutation is present in a mixture; 

wherein each resistance coefficient reflects the contribution of the mutation or 
combination of mutations to the fold resistance whibited by the strain; 

and wherein the error term represents the diiBference between the modelled prediction 
and the experimentally determined measurement 

As flie skilled reader will appreciate, the fold resistance of the HIV strain may be 
calculated using any one of the embodiments of the invention referred to above. 

In the first step of this method, the genetic sequence of an HIV strain should be 
obtained. Normally, this will be the genetic sequence of an HIV strain with which a 
patient is infected, although the sequence may be a theoretical sequence, for example 
for purposes of in silico modelling. 

The method may thus be used as a diagnostic method for predicting the fold resistance 
exhibited by a particular HIV strain with which a patient is infected. According to other 
preferred embodiments, the method may be used for assessing the efficiency of a 
patient's therapy or for evaluating or optimizing a therapy. The method may be 
performed for each dmg or combination of drugs currently being administered to the 
patient so as to obtain a series of drug resistance phenotypes and thus to assess the 
effect of a plurality of drugs or drug combinations on the predicted fold resistance 
exhibited by the HIV stram with which the patient is infected. 
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I 

A "patient" may be any organism, particularly a human or other mammal, suffering 
fiom HIV or AIDS or in need or desire of treatment for such disease. A patient includes 
any mamma] and particularly humans of any age or state of development 

To obtam an HIV strain from a patient, a biological sample will need to be obtained 
5 from the patient. A "biological sample" may be any material obtained in a direct or 
mdirect way from a patient containing HIV virus. A biological sample may be obtained 
fiom, for example, saliva, semen, breast milk, blood, plasma, feeces, urine, tissue 
samples, mucous samples, cells in cell culture, cells which may be furflier cultured, etc. 
Biological samples also include biopsy samples. 

10 The genetic sequence of an HIV strain may be evaluated by a number of suitable 
means, as will be clear to those of skill in the art Most suitable will be techniques that 
allow for specific nucleic acid amplification, such as the polymerase chain reaction 
(PGR), alfliough other techniques such as restriction fragment length polymorphism 
(RFLP) analysis will be equally tqjplicable. 

15 Nucleic acid sequencing then aUows the analysis of the mutation pattern in a particular 
nucleic acid sequmce, eithar by classical nucldc sequencing protocols e, g. extension 
chain termination protocols (Sanger technique; see Sanger R, Nicher., Coulson A. 
Proc. Nat. Acad. Sci. 1977, 74, 5463-5467) or chain cleavage protocols. Such methods 
may employ such enzymes as the Klenow fiagment of DNA polymerase I, Sequenase 

20 (US Biochemical Coip, Cleveland, OH), Taq polymerase (Perkin Ehner), thermostable 
T7 polymerase (Amersham, Chicago, IL), or combinations of polymerases and proof- 
reading exonucleases such as those found in the ELONGASE AmpUficatiort System 
marketed by Gibco/BRL (Gaithersburg, MD). Preferably, the sequencing process may 
be automated using machines such as the Hamilton Micro Lab 2200 (Hamilton, Reno. 

25 NV), the Peltier Thennal Cycler (PTC200; MJ Research. Watertown, MA) and the ABI 
Catalyst and 373 and 377 DNA Sequencers (Perkin Ehner). Particular sequencing 
methodologies have been developed fijrther by conqjanies such as Visible Genetics. 
Any of the novel approaches developed for unravelling the sequence of a target nucleic 
acid, either now or in the future will be perfectty applicable to the analysis of sequence 

30 in the present invention (including but not limited to mass spectrometry, MALDI-TOF 
(matrix assisted laser desoiption ionization time of flight spectroscopy) (see Graber J, 
Smith C, Cantor C. Genet Anal. 1999. 14, 215-219) chip analysis (hybridization based 
techniques) (Fodor S P ; Rava R P ; Huang X C ; Pease A C ; Hohnes C P ; Adams C L 
Nature 1993, 364, 555-6) It should be q)pieciated that nucleic add sequencing covers 

35 bofli DNA and RNA sequencing. 
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Once flie genetic sequence of the HIV strain is known, the pattern of mutation must be 
identified in the sequence. The term "mutation" as this is used herein, encompasses 
both genetic and epigenetic mutations of the genetic sequence of wild type HIV. A 
genetic mutation includes, but is not limited to. (i) base substitutions: single nucleotide 
polymorphisms, transitions, transvwsions, substitutions and (ii) frame shift mutations: 
inswtions, repeats and deletions. Epigenetic mutations include, but are not limited to, 
altaations of nucleic acids, e. g., methylation of nucleic acids. One example includes 
(changes in) methylation of cytosine residues in the whole or only part of flie gaietic 
sequence. In the present invention, mutations wiU generally be considered at the level 
of the amino acid sequence, and comfirise, but are not limited to, substitutions, 
deletions or insertions of amino acids. 



The "control sequaice" or "wild type" is the reference sequence fiom which the 
existence of mutations is based. A control sequence for HIV is HXB2. This viral 
genome comprises 9718 bp arid has an accession number m 'Cenbank at NCBI M38432 
15 or K03455(gi number: 327742). 

Identifying a mutation pattern in a genetic sequence under test thus relates to the 
idraitification of mutations in flie genetic sequence as compared to, a wild type 
sequence, which lead to a change in nucleic acids or amino acids or whidi lead to 
altered expression of the genetic sequence or altered expression of tiie protein encoded 
by tile genetic sequence or altered expression of tiie protan under control of said 
genetic sequaice. 



A "mutation pattern" comprises at least one mutation influencing sensitivity of HIV to 
a tiierapy. As such, a mutation pattern may consist of only one single mutation. 
Alternatively, a mutation pattern may consist of at least two, at least tfiree, at least four, 

25 at least five, at least six, at least seven, at least eight, at least nine or at least ten or more 
mutations. A mutation pattern is flius a list or combination of mutations or a list of 
combinations of mutations. A mutation pattern of any particular genetic sequence may 
be constructed, for example, by comparing flie tested gpnetic sequence against a wild 
type or control sequence. The existence of a mutation or the existence of one of a group 

30 of mutations can Uien be noted. 

One way in which this may be done is by aligning the genetic sequence under test to a 
wild type sequence noting any differences in ttie aUgnment. Typical alignment methods 
include Smifli-Watoman (Smitii and Waterman, (1981) J Mol Biol, 147: 195-197), 
Blast (Altschul et aL (1990) J Mol Biol., 215(3): 403-10), FASTA (Pearson & lipman! 
35 (1988) Proc Natl Acad Sd USA; 85(8): 2444-8) and, more recentty, PSI-BLAST 
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(Altschul et al (1997) Nucleic Acids Res.. 25(17): 3389-402). It may in some 
circumstances be preferable to generate aUgmnents using a multiple alignment 
program, such as ClustalW (Thompson et al., 1994, NAR, 22(22), 4673-4680). Other 
suitable methods will be clear to those of skill m the art (see also "Bioinfonnatics: A 
practical guide to the analysis of genes and proteins" Eds. Baxevanis and Ouellette, 
1998. John Wiley and Sons, New Yoik). A practical example of multiple sequence 
aligmnent is the construction of a phylogenetic tree. A phylogenetic tree visualizes tiie 
relationship between different sequences and can be used to predict future events and 
retrospectively to devise a common origin. This type of analysis can be used to predict 
a similar drug sensitivity for a sample but also can be used to unravel flie origm of 
different patient sample (i. c. tiie origin of tiie viral strain). 

In tiiis manner, therefore, the pattem of mutations in the genetic sequence can be 
identified, wherem said mutations are associated with resistance or suscqptibiKty to 
drug ttierapy exhibited by tiie HIV strain tested. The mutation pattem may mfluence 
sensitivity to a specific tiierapy, e. g., a drug, or a group of flierapies. The mutation 
pattem may, for example, increase and/or decrease resistance of the HIV stoain to a 
thet^y. Particular mutations m tiie mutation pattern, may also, for example, enhance 
and/or decrease tiie mfluence of ottier mutations present in tiie genetic sequence fliat 
effect sensitivity of tiie mv strain to a tiierapy 

The invention further relates to a diagnostic system as herein described for use in any 
of tiie above described methods. An example of such a diagnostic system, for 
quantitating tiie individual conhibution of a mutation or combination of mutations to 
tiie dmg resistance phenotype exhibited by an HIV strain, comprises: 

a) means for obtainmg a genetic sequence of said HIV strain; 

b) means for identi^g tiie mutation pattem in said genetic sequence as compared to 
wildtypeHIV; 

c) means for predicting tiie fold resistance exhibited by tiie HIV strain usmg any one of 
the methods described above. 

The means for predicting tiie fbid resistance are preferably computer means. 

A stiU further aspect of tiie mvention relates to a computer apparatus or computer-based 
system adapted to perform any one of tiie metiiods of flie invention described above, for 
example, to quantify tiie individual contribution of a mutation or combination of 
mutations to tiie drag lesisbmce phenotype exhfliited by HIV, or to calculate flie 
quantitative contribution of a mutation pattem to ttie dmg resistance phenotype 
exhibited by an HIV stram. 
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In a preferred embodiment of the invention, said computer apparatus may comprise a 
processor means incoiporating a memory means adapted for storing data; means for 
inputting data relating to the mutation pattern exhibited by a particular HIV strain; and 
conq)uter software means stored in said computer memory that is adapted to perform a 
method according to any one of Ihe embodiments of the invention described above and 
output a predicted quantified drag resistance phenotype exhibited by an HIV strain 



A computer system of this aspect of the invention may comprise a central processing 
unit; an input device for inputting requests; an output device; a memory; and at least 
one bus connecting the central processing unit, the memory, the input device and the 
output device. The memory should store a module that is configured so that upon 
receiving a request to quantify the individual contribution of a mutation or combination 
of mutations to the drug resistance phenotype exhibited by HIV. or to calculate the 
quantitative contribution of a mutation pattern to the drug resistance phenotype 
exhibited by an HIV strain, it perfomis the steps listed in any one of the methods of the 
invention described above. 



In the apparatus and systems of these embodiments of the invention, data may be input 
by downloading the sequence data fiom a local site such as a memoiy or disk drive, or 
alternatively from a remote site accessed over a network such as the mtemet The 
20 sequences may be input by k^oard, if required. 

The generated results may be output in any convenient format, for example, to a 
printer, a word processing program, a graphics viewing program or to a screen display 
device. Other convenient formats wiU be apparent to the skilled reader. 



The means adapted to quantify the individual contribution of a mutation or combination 
25 of mutations to the drag resistance phenotype exhibited by HIV. or to calculate the 
quantitative contribution of a mutation pattern to the drag resistance phenotype 
exhibited by an HIV strain AviU preferaWy comprise computer software means. As the 
skilled reader will appreciate, once the novel and inventive teaching of the invention is 
appreciated, any number of different computer software means may be designed to 
30 implement this teaching. 

According to a still fiirtha asp^ of the invention, there is provided a computer 
program product for use m conjunction with a computer, said computer program 
comprising a computer readable storage medium and a computer program mechanism 
embedded therein, the computer program mechanism comprising a module that is 
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configured so that upon receiving a request to quantify the individual contribution of a 
mutation or combination of mutations to the drug resistance phenotype exhibited by 
HIV, or to calculate the quantitative contribution of a mutation pattern to the drug 
resistance phenotype exhibited by an HIV strain, it perfoims the steps listed in any one 
of die methods of the invention described above. 

The invention further relates to systems, computer program products, busmess 
methods, server side and client side systems and mefliods for generating, providing, and 
transmitting the results of the above methods. 

The mvention will now be described by way of example with particular reference to a 
specific algorithm that implements the process of the invention. As the skilled reader 
wiU appreciate, variations from this specific illustrated embodiment are of course 
possible without departing fiwm the scope of the invaition. 



I 
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BRBEF DESCBIPTION OF THE HGUSES 

Figure 1 : Overview of measured /predicted phemotype handling 

Figure 2: Phenotype distribution of RTV for matching G/P samples in Virco database 

Figure 3: pFR distribution for SQV) 

Figure 4a: Distribution of pFR for '48V' mutation on SQV 

Figure 4b: Distribution of pFR for '48V' mutation on SQV (expanded) 

Figure 5: Removing unwanted correlations 

Figure 6a: global mutation trajectories 

Figure 6b: mutation trajectories for 7 II 

Figure 7: Exanq)le of genotypes, mutations relative tp HBX2 

Figure 8: Example of phenotype analysis for RTK 

Figure 9: Higher order mteraction betwem mutations 82A and 84V 

Figure 10: Illustration of iterative procedure for censored vahies 

Figure 11: Linear regression model identifies mutations included in IAS Ust Mutations 

marked with an * are also identified by a regression on a 5% subset of the data 

Figure 12: Linear regression model identifies additional mutations previously described 

m tile literature 

Figure 13: Predicted versus measured log(FC) 

Figure 14: Comparison betwero linear regression model and decision trees. 
EXAMPLES 
Example 1: Methodoli^ 
1.1 Introduction 

This exercise uivolved flie generation of a list of key mutations for each of the 
following drugs: Indinavir, Ritonavir, Saquinavir, Nelfinavir, Amprenavir, Lopinavir, 
Zidovudine, Didanosine, Zalcitabme. Stavudme. Abacavir. Lamivudine, Tenofovir. 
Nevk^ine, Delavirdine and Efevirenz. 

The obtained list of key mutations is derived torn a linear regression model using 
single mutations and couples of mutations as independent variables. The dataset used 
for this analysis is an export of flie Virco dataset at 2003/02/01 &om the vtrcomimng 
tables. Table 1 shows tiie matching geno/pheno counts for each drug {each phenotype 
measuranent for a genotype counts as one measurement). 

Table 1: matching geno/pheno counts 
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Dnig 


Count 


Drug 


Count 


Drug 


Count 


Amprenavir 


29,508 


Lamivudine 


34,395 


Delavirdine 


32,450 


Indinavir 


34,445 


Abacavir 


32,744 


Efavirenz 


32,601 


Lopinavir 


7,410 


Stavudine 


34,420 


Nevirapine 


34,738 


Nelfinavir 


34,470 


Zalcitabine 


34,539 




Ritonavir 


34,502 


Didanosine 


34,227 






Saquinavir 


34,543 


TOTofovir 


14,591 










Zidovudine 


33,575 







U Dataset 

The used dataset consists of a set of matching genotype/phenotype measurements with 
possible multiple phenotype measurements per genotype. The mutations are defined 
relative to HXB2 at amino acid level. Tlie phenotypes are presented as pFR values, 
which is equal to - log (FB), vrher&FR denotes fbR Fold Resistance. 

Negative//F2i values denote resistance and positive values denote hyper-susceptibility. 
For example, apFR value of -1.0 is equal to 10-fold resistance. 

1.3 Linear regression 

The derived models are based on the current research on the next generation 
F»tiiafi>henotype using linear regression models for phenotype prediction. In linear 
models, the phenotype (pFR) prediction is the sum of all individual contributions for 
each of the mutations in the genotype as in the following equation: 

^^^=^10/^10/+^10/^10ir+- 
The indq)endent variables Mioj, Miop, 

• 1 if the mutation is present 

• 0.5 if the mutation is present in a mixture 

• 0 ifthe mutation is not present 

Mixtures are modeUed as causing the average shift of its constituent mutations, so 
mixture with wild type causes half thepF/? shift of the resistance mutation. 
The coefficients 0joi. fijop. .... &90M represent the contribution to the total pFR 
prediction for each single mutation. 

The enor term is s. which is the difference between the prediction and the 
measurement. This enror term, contains both the measurement error on the phenotype 
measur«nent and a model error (if the underlying model has higher order temis that 
are not taken into account in the regression model). 



<> M^QMtske the values 
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Each Ml represents the presence or absence of the corresponding mutation (absent: 0, 
present: 1) and each .ft represents the contribution to the pFR change for that specific 
mutation. 

Linear models can contain 2""* order terms (and in general n^"" order terms) of the fonn: 

^10F90M^10F9QM 

The independent variable Mi ftPPflfl/ represents the co-occurrence of mutations lOF and 
90M and the coefficient fiJOFPOM rq>resents the synergy or antagonism between 
mutations lOF and 90M. 

For a 2"*^ order tenn, the independent variables M\0F9OMf take the value: 
1 if both inutations are present and are not in a mixture 

0 ifoneof the mutations is not present 

0,5 if both mutations are present and one mutation is present in a mixture 
0.2S if both mutations are present m a mixture 

Higher order terms effect for interactions between mutations: 
reversal or antagonism: positive pFR shifts for mutation couples. 

synergy or enhancement: extra negative pFR shift for mutation couple. 
For example, consider the following (artificial) model: 



Mutation 


Coefficitot 


84V 


-0.46 


50V 


-0.92 


54M 


-0.64 


88S 


0.63 


90M 


-0.16 


46L 


-0.19 


46L&84V 


-0.09 



Consider a virus with following mutations: 31, 46L, 84V and 90M. niis virus will have 
apFJZ prediction: 

/'^^ = ^46ll+^84rl-^^90ilfl-^^46W*=-'-' 

OT almost 8-fold resistance. Note that in the model 46L and 84V are synergistic since 
thdr co-occunence decreases the pFR by an extra - 0.09. 
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Figure 7 (table I from poster) shows an example of four different genotypes (mutations 
relative to HBX2). whilst Figure 8 (Table 2 from poster) shows an example of 
phenotype analysis for RTV perfonned according to the method of the invention. 

.4 Model creation 

Using our fedlities, it was computationally infeasfl>le to calculate a fiiU second order 
model on aU possible mutations and second order terms, since the number of terms 
increases quadratically with the number of mutations considered. 
E.g.forAPV: 

• Total number of occurring mutations and couples of mutations: 19,074 

• mutations and couples with each at least 20 measurements: 4,107 

Ih order to reduce the amomit of terms, a first order regression was perfonned from the 
list of mutations that occur at least 20 times in the dataset. The significant terms' from 
this regression were witiiheld and the Ust of these terms^-' was used to perform a second 
order regression. In the second order regression only the single mutations and couples 
of mutations are used that were significant in the first order model 

5 Example of model creation: Indinavir 

A first order regression is performed on the matching geno/pheno dataset (34445 
measurements of which 28,480 unique Virco IDs) fi,r those mutations that occiir at 
least 20 times. The resulting first order model withholds a list of 94 single mutations 
that are considered significant'. This list (except 31) is used as a starting fist fi)r a 
second order regression. Li this second order regression, all the single mutations and aU 
couples of mutations from the Ust are used as potential terms. The significant' terms arc 



^ A term is caUed significant if the probability that the real value of the term is 0, is smaUer than 0.001 
' fecept mutation 31 ibr some of the Pi's. The reason is that a mutation must occur at least 20 times and 
the inverse also bas to be true: the count of vinises not having the mutation 31 or the couple not 31 and 
another mutation should also be at least 20. TOmg this into account results in excluding 31 from the 
regtession in practice. 

' For d4T, flie amount ofl" order terms was too high to perform a second order analysis. Only the terms 
widiaa absolute value of the cocflBcient > 0.1 are used in the 2"* order analysis. 
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1.6 Discussion 

linear regression seeks the model that best fits the underlying data assuming that the 
underlying data behaves according to a linear model hi that respect, some aspects of 
this technique have to be taken into account when , analysing the results &om a 
5 regression. 

1.6.1 The impact ofcross-dnig comlatiott on the significance level of mutations 

Correlation between mutations that cause resistance to different drugs, has an impact on 
flie confidttice of the coefiBcient for this mutation. One of the effects is that for non- 
nucleoside reverse transcriptase mhibitors (NNRTIs) and nucleoside reverse 
10 transcriptase inhibitors (NRTIs), some non-relevant mutations for that drug q}pear as 
significant (though with a coefficient close tb 0), because drug resistance to flie drug is 
correlated with drug resistance to dmgs that bmd at a different place. 

Note that this is only a problem for interpretation of the model. For prediction of the 
^. , Fold Resistance, the resulting model remains a good pFR predictor. 

13 1,6.2 Effects of second order terms 
• Example 1: antagonism 



Parameter 


pFRabm 


Ccmt 


82A&84V 


0.43 


395 


82A 


•0.27 


4845' 


84V 


-0.26 


3531 









Second order terms can indicate a synergy or an antagonism. In the example above, the 

occurrence of eitha: 82A or 84V cause a resistance shift, but flie co-occurrence of bofli 

mutations almost conq>letely cancels out the elBEect of both mutations. Ih case both 

mutations are presait, the net pFR shift is only -0.10, while it is -0.26 or -0.27 if only 

one of the mutations are present. TTiis is an example of (strong) antagonism. 
• Example 2: syner^ 



Pararoeier 


pFR shift 


Count 








241 


•0.22 


1022 


24I&73S 


-0.48 


30 


73S 


-0.20 


2216 
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Ih this example, 241 and 73S both cause a resistance shift, but ttieir co-occunence 
causes a strong extra shift towards resistance. When only one of the mutations is 
presrait, thepFR shift is -0.20 or -0.22. but the presence of both mutations causes SLpFR 
shift of -0.90. 241 and 73S are thus strongly synergistic in this example. 
• Example 3: enhancement 



Parameter 


pFR shift 


Count 


321 


0 


821 


32I&82A 


-0.26 


516 


82A 


-0.27 


4845 
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321 by itself does not contribute to resistance, but it increases the resistance for an 82A 
mutation. 321 enhances the effect of the 82A mutation. 

An example of the effects of higher order interactions is ^own in Figure 9: (figure 3 
fix>m poster). 

1.6.3 H^hfy correlated mutaiiotts 

Highly correlated mutations (i.e. mutations fliat ahnost always co-occur in a strain) can 
affect the results of a regression analysis. For example, when one of these 2 mutations 
has an effect on resistance and the other mutation does not (this mutation might for 
example be highly correlated with the resistance mutation because it affects the 
replication rate of the virus), the effect can be assigned to either one of the mutations. 
Unless this is compensated for, the regression model will assign the effect to that 
mutation that reduces the prediction error the most, which might not always be the 
mutation that is biologically responsible for the effect. Due to the correlation, it would 
otherwise not be possible to distinguish between these mutations. 

Another effect that occurs due to correlation is when a mutation is highly correlated 
with a pair of mutations in which flie first mutation is present. 



Poroinrtcr 


pFR shin 


Count 


S8N 


-1.47 


108 








58N4^77L 


1.16 


106 


77L 


0 


471 
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h the above example, 108 samples have a 58N mutation and out of these, 106 samples 
also have a 77L mutation. The effect of a pure 58N mutation can only be derived from 
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the 2 samples that have 58N and do not have 77L, which leads to higher uncertainty on 
the estimated pFR shift of the 58N mutation. The couple-term '58N & 77L' will 
compensate for a too low estimation of 58N by having a too high estimation for its pFR 
shift 

5 Techniques are being developed to deal with these effects. A preferred way to do this is 
to use an algorithm developed by ttie inventors to track the change in pFR as the effects 
of individual mutations or combinations of mutations are removed from the dataset. 
The effect of each mutation or combination of mutations are thus separated out The 
methodology follows mutation trajectories towards the global average as the effects of 

10 individual mutations or combinations of mutations are removed. The steps are as 
follows: 

1 . Calculate average pFR for all mutations with a sufficient count in the 
database to be significant; 

2. Determine the extremes (maximum, minimum), and select the 
IS mutation with the pFR furthest away from tfie global average; 

3. Remove all virus strains that have the selected mutation and reiterate 
from step 1; 

4. Stop when the selected mutation in step 2 has an average pFR that 
approximates to the global average. 

20 In this manner, mutations that don't cause resistance, but which are oftea present with 
mutations that do cause resistance will have a higher average pFR (less resistance). 
Removing the virus strains with a certain resistance causing mutation results in an 
increase of the average pFR for correlating mutations. 

A suitable threshold at which a count in flie database becomes significant is around 20 
25 times. 

A comparison of the change in the global average pFR with the change in the average 
pFR for selected mutations with increasing iterations of the algorithm is shown in 
Figure 6a. Figure 6b shows an example, where tiie average pFR for 711 (unwanted 
correlation) jumps up as a result of removing from the dataset virus strains that have 
30 •*71I & 84V and "48V" mutations. 

An altmiative or even additional methodology for ranoving unwanted correlations is 
as follows; this is an extension of the mutation trajectories algorithm discussed above. 
The steps of tiiis method are as follows: 
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1. Calculate correlation coefficient between all mutations (with a sufficient 
count in the database) and the pFR; 

2. Determine the extremes (maximum, minimiun), and select the mutation 
wifli the highest (absolute value of) conelation coefficient; 

3. Calculate a linear model for the pFR with the selected mutation(s) (from 
step 2, all previous iterations); 

4. Take the residue (pFR minus the predicted value from the model); 

5. Calculate correlation coefficient between all mutations (with a sufficient 
count in the database) and the residue; 

6. Determine the extremes (maximum, minimum), and select the mutation 
with the highest (absolute value of) correlation coefficient; 

7. Calculate a linear model for the pFR with the selected mutation(s) (from 
step 6, all previous iterations); and 

8. Reiterate from step 4; 

9. Stop when the selected mutation in step 7) has a correlation coefficient 
that approximates to zero. By approximates to zero is meant that the 
conelation coefficient is within a fraction of the standard deviation of 
the remaining population. A convenient fraction is about 0.4. 

As with the mutation trajectories algorithm described above, the effect of mutations 
that do not themselves cause resistance, but which are often present with mutations that 
do cause resistance, is excluded and thus does not distort the real values, 

L6.4 Missing second order terms and higher order terms 

The regression models were built by first executing a first order regression and then 
usmg the list of significant terms from this regression to build a fidl second order 
model. However, it is possible that a pair of mutations has an effect while the single 
mutations do not have an effect by themselves. A first order regression might consider 
the smgle mutations as not significant, so these mutations are not used in the second 
order regression. The couple term for the pair of mutations is therefore not in the final 
model while this term could be significant Future regression models can be tuned to 
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overcome this limitation. Alternatively, greater computing pov^er will resolve the need 
first to perform a first order regression. 

The current approach does not involve triplets, quadruples, ... of mutations. It is 
technically possible to include these terms as well in a linear regression, but hi^er 
order terms cause a combinatorial explosion of the number of terms in the regression. 
This increases the computation time and memory use significantly. Techniques to 
select a subset of these higher order terms are currently being developied. Expert 
knowledge can also be a source to select a subset of interesting higher order temos. 

.6.5 Censor values 

Censored values occur in the genotype / phenotype database (for example, EC50 value = 
*>lnM'). These are phenotypes beyond the assay range. 

Censored values can be dealt with by attempting to construct a model that is consistent 
fiom extrapolations. Censored values are thus modelled by replacing the censored 
value by a maximum likelihood estimation, assuming knowledge of the standard 
deviation of the measurement error. 

A preferred technique for the generation of a maximum likelihood estimation is as 
follows: 

• Use value V as if the censor was " = 

• Calculate linear regression model; 

• Look at the prediction P fi'om the model: 

■ P<V-0.798(centreof gravity ofhalfGaussiandistribution) 

o Remove value fiiom training data for next iteration 

■ V-0.798<P<V 

o Use V* = V- 0.798 

■ V<P 

o Use V* centre of gravity of t^ (<V) of a normal distribution N (P, ) 
as value for next iteration. 

Accordingly, for each iteration, whm the prediction and measurement contradict, 
censored values are taken into account Vfhea the prediction and measurement are 
consistent, censored values are disregarded, on die basis that no further information is 
provided and theur inclusion has no worth. 
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Figure 10 (figure 4 from poster) illustrates diagrammatically flie iterative procedure for 
censored values. 

Example 2: Results 

Jn an initial test, genotypes and corresponding phenotypes determined for ritonavir 
5 (RTV) for 28,540 HIV-1 clinical isolates were used. The linear regression analysis 
identified 20/22 RTV resistance-associated mutations described in the IAS mutation list 
(all except lOF and 771) (see Figure 11: (table 4 from poster)). Additional mutations 
whose effect on RTV susceptibility had been previously described {e,g. 73S/T/C, 
84A/C and 88D) were also identified (Figure 12: (table 5 from poster)). Overall, 
10 53 single mutations and 96 pairs of mutations were identified as having significant 
effect on susceptibility to RTV. 

The predicted phenotype was compared to the measured phenotype in a leave-one-out 
cross-validation, demonstrating a root mean square error of 0.31 (logFR) (see Figure 
13: (figure 5 from poster). The error rate of the linear modeling method [5.62% 
15 (sensitivi^3.0%, specificity=95.4%)], compared favourably to a decision tree-based 
model [Beerenwinkel, PNAS 99, (2002) 8271-82761 [10.2% (sensitivity=89.8%, 
specificity=89.7%)] (see Figure 14 (table 3 &om poster)). 

The robustness of ttie algorithm as a fimction of the size of the input dataset was 
assessed using smaller subsets of data. Nine of 22 IAS resistance-associated mutations 

20 for RTV could be identified with subsets > 5% (1600 isolates) of the original data. 
However, the accuracy of the predicted contribution of the mutations improved with 
increasing dataset sizes up to 50% of the original database (median standard error of 
flie predicted contributions decreased 50%). Some secondary mutations (e.g. lOR, 321, 
82S) were identified as having a significant contribution to resistance only when the 

25 subset size reached a similar 50% level. 

We thus conclude that linear regression modelling is a promising new technique for the 
analysis of drug resistance in HIV-L It is an attractive tool for identifying primary and 
secondary resistance-associated mutations for new and existing drugs and for 
calculating the contribution of mutations and combinations of mutations to resistance, 
30 The power of the mefliod is most fiilly exploited when applied to large datasets of 
matched.genotype^phenotype results. 
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CLAIMS 

1. A method for quantitating the mdividual contribution of a mutation or combination 
of mutations to the dmg resistance phenotype exhibited by HIV, said method 
comprising the step of performing a Unear regression analysis using data fiom a 
dataset of matching genotypes and phenotypes, 

wherein the log fold resistance, pFR, of each HIV strain is modelled as Ae sum of 
all the individual resistance contributions for each of the mutations or combinations 
of mutations that occur in HTV according to the following equation; 

wherein each individual resistance contribution is calculated by multiplying a 
mutation fector. Ma , Jl^ , .... Mz, for each mutation or combination of mutations 
by a resistance coefficient jflf^, jSjj, jS^, 

wherein the mutation factor assigned to each mutation or combination of mutations 
reflects the degree to which that mutation or combination of mutations is present in 
the HIV strain and, if present, to which degree the mutation is present in a mixture; 

wherein each resistance coefficient reflects Oie contribution of the mutation or 
combination of mutations to the fold resistance exhibited by flie strain; 

and wherein the enor tenn s, represents the difference between the modeUed 
prediction and Qie experimentally determined measurment 

2. A method according to claim 1, wherein for a combination of mutations, the 
mutation factor Mab represents the co-occurrence of mutations A and B and the 
coefficient ^45 represents the synergy or antagonism between mutations A and B. 

3. A m^od according to any one of the preceding claims, wherein calculation of the 
resistance coefficient {^A, to. .... te ft42») is performed by evaluating flie dataset 
for the drug phenotype rqwrted for each mutation or combination of mutatioi^. 

4. A mettiod accordmg to any one of the preceding claims, wherein correlations are 
removed ftom flie dataset for correlated mutations where not all conelated 
mutations contribute to the drug resistance phoio^e, uang an al^ritfam to track 
the change in pER for each mutation as the effects of individual mutations or 
combinations of mutations are removed fiom the dataset. 
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A method according to any one of the preceding claims, wherein the algorithm 
peifonns the following steps: 

a) calculate average pFR for all mutations with a suflScient count in the 
database to be significant; 

b) determine the extremes (maximum, minimum), and select the mutation with 
the pFR furthest away firom Qie global average; 

c) remove all virus strains that have the selected mutation fiom the dataset and 
reiterate fiom st€p a); 

d) stop when the selected mutation in step b) has an average pFR that 
approximates to the global average; . . 

such that removing virus strains with a certain resistance causing mutation 
results in an increase of the average pFR for correlating mutations, which thus 
have a higher average pFR. 

A method according to any one of claims 1-4, wherein the algorithm performs the 
following steps: 

a) calculate correlation coefi&cient between all mutations with a sufficient 
count in the database and the pFR; 

b) determine the extremes (maximum, minimum), and select the mutation witb 
the hi^est absolute value of correlation coefficient; 

c) calculate a linear model for the pFR with Uie selected mutation(s) (from step 
b), all previous iterations); 

d) take the residue; 

e) calculate correlation coefficient between all mutations with a sufficient 
count in the database and the residue; 

0 detennine the extremes (maximum, minimum), and select the mutation wiOi 
the highest absolute value of correlation coefficient 

g) calculate a linear model for the pFR with the selected mutation(s) (from step 
f), all previous iterations); and 



I 
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h) reiterate from step d; 

i) stop when the selected mutation in step h) has a correlation coefBcient that 
approximates to zero. 

7. A method according to any one of the preceding claims, wherein multiple entries of 
the same virus strain or virus strains grown from the same stock solution that cause 
unwanted correlations are removed from the dataset. 

8. A method according to any one of the preceding claims, wherein censored values in 
the genotype / phenotype database are replaced by a maximum likelihood 
estimation. 

9. A method according to claim 8, wherein for each iteration of the linear regression, 
ttie following steps are performed: 

a) the censored value [> -X] is initially treated as [= -X]; 

b) using this value, a linear regression model for predicted pFR is calculated 
using related values relevant to the pFR of the mutation or combination of 
mutations; 

c) if the calculated model for predicted pFR P is consistent with the censored 
vahie, the value is i^iored in the next iteration; 

d) if the calculated model for predicted pFR P is inconsistent with the censored 
value, the value is used in the hext iteration. 

10. A method of identifying a mutation that effects the degree of drug resistance 
exhibited by an HIV strain using a method according to any one of the precedmg 
claims. 

1 1. A method of calculating the quantitative contribution of a mutation pattern to the 
drug resistance phenotype exhibited by an HIV strain, said method comprising tiie 
steps of: 

a) obtaining a genetic sequence of said HIV stram, 

b) identifying the pattern of mutations in said genetic sequence, wherein said 
mutations are associated with resistance or susceptibility to drug therapy, and 
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c) calculating the fold resistance of the HIV strain as compared to the wild type 
HIV strain by performing a linear regression analysis, whereby the log fold 
resistance, pFR, is modelled as the sum of all the individual resistance contributions 
for each of the mutations or combinations of mutations that occur in said HIV strain 
according to the following equation; 

wherein each individual resistance contribution is calculated by multiplying a 
mutation factor. Ma , Mb , .... Mz, for each mutation or combination of mutations 
by a resistance coefficient jSg, fiz^ 

wherein flie mutation factor assigned reflects the degree to which the mutation or 
combination of mutations is present in the HIV strain and, if present, to which 
degree the mutation is present in a mixture; 

wherein each resistance coefficient reflects the contribution of the mutation or 
combination of mutations to the fold resistance exhibited by the strain; 

and wherem the error term e, represents the difference between the modelled 
prediction and the experimentally detemained measurement. 

12. A method accordmg to claim 11, which incorporates a method according to any one 
of claims 1-9. 

13. A diagnostic method for optunising a drug therapy in a patient, comprismg 
performing a method according to any one claims 11-12 for each drug or 
combination of drugs being considered to obtaining a series of drug resistance 
phenotypes and therefore assess the effect of the plurality of drags or drug 
combinations on the predicted fold resistance exhibited by the HIV strain with 
which the patient is infected and selectmg the drug or drug combination for which 
the HIV strain is predicted to have the lowest fold resistance. 

14. A method according to any one of claims 1 143, wherein the resistance coefficient 
for each mutation is calculated using a method according to any one of claims 1-9. 

15. Use of a method according to any one of claims 1-9 for assessing the efficiency of a 
patient's therapy or for evaluating or optimizing a therapy. 



.34- 

16. A diagnostic system for quantitating the individual contribution of a mutation or 
combination of mutations to the drug resistance phenotype exhibited by an HIV 
strain, said system comprising: 

a) means for obtaining a genetic sequence of said HIV strain; 

b) means for identifying the mutation pattern in said genetic sequence as compared 
to wild type HIV; 

c) means for predicting the fold resistance exhibited by the HIV strain using any 
one of the methods of claims 1-14. 

17. A conq;>uter apparatus or computer-based system adapted to perform the method of 
any one of the claims 1-14. 

18. A computer program product for use in conjunction with a computer, said computer 
program comprising a computer readable storage medium and a computer program 
mechanism embedded therein, the computer program mechanism comprising a 
module that is configured so that upon receiving a request to quantify the individual 
contribution of a mutation or combination of mutations to the drug resistance 
phenotype eKhibited by HIV, or to calculate the quantitative contribution of a 
mutation pattern to the drug resistance phenotype exhibited by an HIV strain, it 
performs a method according to any one of claims 1-14. 
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ABSTRACT 

5 

QUANTITATIVE PREDICTION METHOD 

The present invention concans methods and systems for analysis of drug resistance in 
HIV-1. More specifically, the invention provides metiiods for predicting drug 

10 resistance by correlating genotypic information with phenotypic profiles. The metiiods 
allow the identification of primary and secondary resistance-associated mutations for 
new and existing drugs and for calculating the contribution of mutations and 
combinations of mutations to resistance and hypersusceptibility. The invention allows 
the design, optimization and assessment of the efficiency of a therapeutic regimen 

IS based upon tiie genotype of the disease affecting a patient. 
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